Tafazzin is an important mitochondrial acyltransferase responsible for the remodeling of cardiolipin, the major phospholipid making up the mitochondrial membrane. This protein is involved in the final remodeling step of cardiolipin to synthesize mature cardiolipin with tetralinoleic tails that contribute to the structure and function of mitochondria. Defects in tafazzin are known to lead to a disease known as Barth Syndrome, which is characterized by cardiomyopathy, weakness in muscles, and low white blood cell counts. Predicted structures of human tafazzin include an alpha helical transmembrane anchor, a catalytic site for acyltransferase activity, and a positively-charged membrane associated region (Hijikata et al., 2015). While more studies have yet to be done on tafazzin structure itself, other studies have shown an essential histidine and arginine residues, as well as a highly conserved 7 amino acid sequence in the catalytic site of acyltransferases (Heath et al., 1998, Dircks et al., 1999).
Multiple sequence alignment was done as the first bioinformatics analysis. Tafazzin protein sequences from multiple species and other acyltransferase sequences from humans were aligned for comparison. This allows visualization of how conserved the tafazzin protein is compared to other species. Comparing the conserved sequences to other acyltransferase sequences will help identify if they have similar functional domains.
Homology modeling and structural bioinformatics is the second bioinformatics analysis. This method creates a 3D model of what the protein structure would look like. By comparing the structures of the different tafazzin orthologs, one may be able to visualize if similar sequences still lead to similar structures, different sequences lead to similar structures, and so on. If the sequence isn’t conserved, perhaps the structure is still conserved. Additionally, comparing the structure of tafazzin to other acyltransferases involved in the mitochondria may elucidate if there are similar structures that may be important in the acyltransferase activity.
Phylogenetic clustering is the first visualization analysis. This aids in visualizing how connected / related the protein sequences are, as well as where they branch off, using a tree diagram. Phylogenetic clustering may help explain which domains may be more conserved than others, so it gives a better idea of how they evolved, and what was conserved despite the evolution.
3D protein measurement is the second visualization anaysis. 3D protein measurement helps calculate the characteristics of which residues are in which domain / structure of the protein or what their Index of Hydrophobicity is. Distances and angles between atoms can also be calculated. By looking at the specific characteristics of each residue, one can draw a conclusion on what type of residues are conserved in the structure, and what the residues contribute to the structure that is conserved.
The data for this analysis can be found in NCBI (https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=6901, orthologs and homologs can be found by scrolling down to the "General Gene Information" tab and checking under "homology") as well as UniProt (https://www.uniprot.org/uniprot/?query=tafazzin&sort=score). Data was downloaded as fasta files from NCBI and compiled into one fasta file, and data from UniProt was downloaded as separate PDB files.
Bio stands for Biopython. It is a package that contains many modules used for biological analysis. Some modules it contains is AlignIO and SeqIO, which both take in sequences and display sequence alignments. They can take an input from a variety of file formats (such as fasta, clustal, or phylip) and load it into BioPython for it to be used and analyzed. Sequences can either be in separate files or all compiled into just one. Another module is Seq, which takes sequences in as strings rather than files. The Seq objects can be transcribed, translated, and manipulated in other ways to mimic biological methods. More information on BioPython and the mentioned modules can be found here: https://biopython.org/wiki/Getting_Started, https://biopython.org/wiki/AlignIO, https://biopython.org/wiki/SeqIO, https://biopython.org/docs/1.75/api/Bio.Seq.html
os is a package in Python that allows the terminal/Jupyter Notebook to interact with the operating system. In other words, it allows you to work with files and directories on your laptop. It allows you to read files in Jupyter Notebook as well as access different paths. It can also retrieve files from your operating system. More information on os can be found here: https://docs.python.org/3/library/os.html
sys is a package in Python that allows access to certain variables or functions in the Python runtime environment. Here, we use it to get the stout and stderr (standard output and standard error). Using the stout function from the sys module allows us to display the output. On the other hand, stderr will write whenever an exception occurs. More information on sys can be found here: https://www.geeksforgeeks.org/python-sys-module/
Bio.Align.Applications is a package in BioPython that contains many commandline wrappers. The commandline wrappers allow Jupyter Notebook to run downloaded software that can carry out multiple sequence alignments. Such commandline wrappers include MafftCommandline, which is used in this project to do multiple sequence alignment. However, it can handle many more commandline wrappers, such as one for MUSCLE, ClustalW, and so on. Each wrapper contains different functions that will utilize the corresponding software to carry out and display the alignment.More information on Bio.Align.Applications can be found here: https://biopython.org/docs/1.76/api/Bio.Align.Applications.html
tempfile is a package in Python that can create temporary files and directories. It is used when you don't want to create more files and clog up your space/data. Instead, tempfiles can be generated and deleted for the function that you are trying to carry out. Functions include creating named temporary files, secure temporary files, or spooled temporary files. More information on tempfile can be found here: https://docs.python.org/3/library/tempfile.html
Bio.PDB is a package in BioPython that allows for structural bioinformatics. It can read PDB or mmCIF files and even draw structures from PDB directly. Additionally, it contains functions that allows for analysis of the macromolecule structure. Functions include calculating distances or angles between atoms and even superimposing two structures to compare how similar they are. More information on Bio.PDB can be found here: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ
nglview is a package in BioPython that allows for viewing and interacting with 3D protein structures. Once loading the structure, you are able to interact, zoom in, and move the protein around to study its structure. Additionally, you can edit how the protein is viewed. You can view it in cartoon format, change its color, view the hydrogens, and much more. You can also download and display an image after interacting with it. More information can be found here: https://github.com/nglviewer/nglview
Bio.Phylo.TreeConstruction allows us to do phylogenic clustering. The module can read and analyze files to display phylogenic trees based on neighbor joining or Unweighted Pair Group Method with Arithmetic Mean. It can calculate distance matrices and calculate the distances between the proteins being analyzed. Additionally, the resulting tree can be saved in different file formats. More information on this module can be found here: https://biopython.org/wiki/Phylo
Bio.SeqUtils.ProtParam is a module that allows for protein analysis. It can count number and percent of specific amino acids, calculate hydrophobicity, aromaticity, and much more. It can also calculate the isolectric point or charge of the protein at certain pH. It analyzes the properties of the protein in question. More information can be found here: https://biopython.org/docs/1.76/api/Bio.SeqUtils.ProtParam.html
from Bio import AlignIO
from Bio import SeqIO
from Bio import Seq
import os
import sys
from Bio.Align.Applications import MafftCommandline
import tempfile
from Bio.PDB import *
import nglview as nv
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.SeqUtils.ProtParam import ProteinAnalysis
Multiple sequence alignment is a bioinformatics technique that is used to compare three or more different DNA, RNA, or protein sequences to find similarities and maximum matching between them. Multiple sequence alignment can help with identifying structural or functional components of a novel protein and to trace evolutionary relationships between the sequences being analyzed.
In the code below, we read in sequences from a fasta file and convert them into SeqIO and AlignIO objects to align the sequences. We also create temporary files for each sequence in the fasta file, convert each individual sequence in to a SeqIO object, then loop over the sequences to perform multiple sequence alignment using MafftCommandline.
#Define global variable input_file as the fasta file containing all the TAZ sequences
input_file = 'TAZ.fasta'
#Define global variable records as a SeqIO object containing the contents of the fasta file
records = SeqIO.parse(input_file, 'fasta')
#Convert into list
records = list(records)
#code check: are all the records loaded?
#print(records)
#Define global variable maxlen as the length of the longest sequence
maxlen = max(len(record.seq) for record in records)
#code check: is the max length correct?
#print(maxlen)
#create a for loop that
#loop over each sequence and align the sequences, filling in missing characters if the length isn't long enough
for record in records:
if len(record.seq) != maxlen:
#Define local variable sequence containing a string containing a single sequence in the fasta file
sequence = str(record.seq).ljust(maxlen, '.')
#Define local variable record.seq that contains the seq object
record.seq = Seq.Seq(sequence)
assert all(len(record.seq) == maxlen for record in records)
#Define global variable output_file that contains the aligned sequences in fasta format
output_file = '{}_padded.fasta'.format(os.path.splitext(input_file)[0])
with open(output_file, 'w') as f:
SeqIO.write(records, f, 'fasta')
#define global variable alignment that creates an Alignment object containing the MSA
alignment = AlignIO.read(output_file, "fasta")
print(alignment)
Alignment with 11 rows and 828 columns MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTV...... TAZ_HUMAN_FL MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTV...... sp|Q16635|TAZ_HUMAN_exon5 MPLHVKWPFPAVPRLTWTLASSVVMGLVGTYSCFWTKYMNHLTV...... sp|Q91WF0|TAZ_MOUSE MPLEVTWPFPQCPRLGWRISSRVVMGMVGSYSYLWTKYFNSLMV...... sp|F1QCP6|TAZ_DANRE MFMVVCSNLRRPGHVGAASAARNINWLISEGYTPPIRAMARPYV...... sp|Q9V6G5|TAZ_DROME MSFRDVLERGDEFLEAYPRRSPLWRFLSYSTSLLTFGVSKLLLF...... sp|Q06510|TAZ_YEAST MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTV...... sp|Q6IV84|TAZ_PANTR MAITLEEAPWLGWLLVKALMRFAFMVVNNLVAIPSYICYVIILQ...... sp|Q92604|LGAT1_HUMAN MDLAGLLKSQFLCHLVFCYVFIASGLIINTIQLFTLLLWPINKQ...... sp|Q9NRZ5|PLCD_HUMAN MDESALTLGTIDVSYLPHSSEYSVGRCKHTSEEWGECGFRPTIF...VVL sp|Q9HCL2|GPAT1_HUMAN MVACRAIGILSRFSAFRILRSRGYICRNFTGSSALLTRTHINYG...... sp|P40939|ECHA_HUMAN
#set global variables reference and count to 0 in order to loop the multiple sequence alignment
reference = 0
count = 0
#creating named temporary files for the for loop
with tempfile.NamedTemporaryFile() as temp:
#make a for loop to do multiple sequence alignment for all sequences in the fasta file
for record in SeqIO.parse(input_file, "fasta"):
if count == 0:
reference = record
else:
SeqIO.write([reference, record], temp.name, "fasta")
mafft_cline = MafftCommandline(input=temp.name)
stdout,stderr=mafft_cline()
print(stdout)
count += 1
print('\n')
ID: TAZ_HUMAN_FL
Name: TAZ_HUMAN_FL
Description: TAZ_HUMAN_FL [Homo sapiens]
Number of features: 0
Seq('MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELI...PGR')
ID: sp|Q16635|TAZ_HUMAN_exon5
Name: sp|Q16635|TAZ_HUMAN_exon5
Description: sp|Q16635|TAZ_HUMAN_exon5 Tafazzin OS=Homo sapiens OX=9606 GN=TAFAZZIN PE=1 SV=2
Number of features: 0
Seq('MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELI...PGR')
>TAZ_HUMAN_FL [Homo sapiens]
MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELIEKRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVP
VCRGAEFFQAENEGKGVLDTGRHMPGAGKRREKGDGVYQKGMDFILEKLNHGDWVHIFPE
GKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIG
KPFSALPVLERLRAENKSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQPGR
>sp|Q16635|TAZ_HUMAN_exon5 Tafazzin OS=Homo sapiens OX=9606 GN=TAFAZZIN PE=1 SV=2
MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELIEKRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVP
VCR------------------------------GDGVYQKGMDFILEKLNHGDWVHIFPE
GKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIG
KPFSALPVLERLRAENKSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQPGR
ID: sp|Q91WF0|TAZ_MOUSE
Name: sp|Q91WF0|TAZ_MOUSE
Description: sp|Q91WF0|TAZ_MOUSE Tafazzin OS=Mus musculus OX=10090 GN=Tafazzin PE=2 SV=1
Number of features: 0
Seq('MPLHVKWPFPAVPRLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNKEVLYELI...PGR')
>TAZ_HUMAN_FL [Homo sapiens]
MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELIEKRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVP
VCRGAEFFQAENEGKGVLDTGRHMPGAGKRREKGDGVYQKGMDFILEKLNHGDWVHIFPE
GKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIG
KPFSALPVLERLRAENKSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQPGR
>sp|Q91WF0|TAZ_MOUSE Tafazzin OS=Mus musculus OX=10090 GN=Tafazzin PE=2 SV=1
MPLHVKWPFPAVPRLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNKEVLYELIENRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVP
VCR------------------------------GDGVYQKGMDFILEKLNHGDWVHIFPE
GKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIG
KPFSTLPVLERLRAENKSAVEMRKALTDFIQEEFQRLKMQAEQLHNHFQPGR
ID: sp|F1QCP6|TAZ_DANRE
Name: sp|F1QCP6|TAZ_DANRE
Description: sp|F1QCP6|TAZ_DANRE Tafazzin OS=Danio rerio OX=7955 GN=tafazzin PE=2 SV=1
Number of features: 0
Seq('MPLEVTWPFPQCPRLGWRISSRVVMGMVGSYSYLWTKYFNSLMVHNQDVLLNLV...NHT')
>TAZ_HUMAN_FL [Homo sapiens]
MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELIEKRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVP
VCRGAEFFQAENEGKGVLDTGRHMPGAGKRREKGDGVYQKGMDFILEKLNHGDWVHIFPE
GKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIG
KPFSALPVLERLRAENKSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQPGR
>sp|F1QCP6|TAZ_DANRE Tafazzin OS=Danio rerio OX=7955 GN=tafazzin PE=2 SV=1
MPLEVTWPFPQCPRLGWRISSRVVMGMVGSYSYLWTKYFNSLMVHNQDVLLNLVDERPQD
TPLITVCNHQSCMDDPHIWGVLKFRQLWNLNKMRWTPTASDICFTREFHSSFFSRGKCVP
VVR------------------------------GDGVYQKGMDFLLERLNQGEWIHIFPE
GRVNMSGEFMRIKWGIGRLIAECSLHPIILPMWHIGMNDVLPNETPYIPRVGQRITVLVG
KPFTVRHLVNALRAENTNPTEMRKTVTDYIQDEFRSLKAQAEALHHRLQNHT
ID: sp|Q9V6G5|TAZ_DROME
Name: sp|Q9V6G5|TAZ_DROME
Description: sp|Q9V6G5|TAZ_DROME Tafazzin OS=Drosophila melanogaster OX=7227 GN=Taz PE=1 SV=2
Number of features: 0
Seq('MFMVVCSNLRRPGHVGAASAARNINWLISEGYTPPIRAMARPYVQAPEARPVPD...ERN')
>TAZ_HUMAN_FL [Homo sapiens]
------------------------------------------------------------
---------------------------------------------------------MPL
HVKWPFPAV--PPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELIEKRGPAT
PLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVPV
CRGAEFFQAENEGKGVLDTGRHMPGAGKRREKGDGVYQKGMDFILEKLNHGDWVHIFPEG
KVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIGK
PFSALPVLERLRAENKSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQPGR
>sp|Q9V6G5|TAZ_DROME Tafazzin OS=Drosophila melanogaster OX=7227 GN=Taz PE=1 SV=2
MFMVVCSNLRRPGHVGAASAARNINWLISEGYTPPIRAMARPYVQAPEARPVPDERYPGS
QQDRKDIATQTVRSSKPKDLRPPSPPTPSQTLNSSSLPPPMSDQDADPSLDVPTGVAMPY
NIDWIFPRLRNPSKFWYVVSQFVVSAVGIFSKVVLMFLNKPRVYNRERLIQLITKRPKGI
PLVTVSNHYSCFDDPGLWGCLPLGIVCNTYKIRWSMAAHDICFTNKLHSLFFMFGKCIPV
VRGI------------------------------GVYQDAINLCIEKAALGHWIHVFPEG
KVNMDKEELRLKWGVGRIIYESPKIPIILPMWHEGMDDLLPNVEPYVIQRGKQVTLNVGQ
PLDLNDFILDLKKRQVPEPTARKLITDKIQEAFRDLRAETEKLHRERN---
ID: sp|Q06510|TAZ_YEAST
Name: sp|Q06510|TAZ_YEAST
Description: sp|Q06510|TAZ_YEAST Tafazzin OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TAZ1 PE=1 SV=1
Number of features: 0
Seq('MSFRDVLERGDEFLEAYPRRSPLWRFLSYSTSLLTFGVSKLLLFTCYNVKLNGF...KDD')
>TAZ_HUMAN_FL [Homo sapiens]
MPL--------HVKWPFPAVPPLTWTLASSVVMGLVGT-----YSCFWTKYMNHLTVHNR
EVLYELIEK-----RGPATPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADI
CFTKELHSHFFSLGKCVPVCRGAEFFQAENEGKGVLDTGRHMPGAGKRREKGDGVYQKGM
DFILEKLNHGD------------------------------WVHIFPEGKV-----NMSS
EFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNS------PPYFPR-FGQKITVLIG
KPFSALPVLERLRAENKSAVEMR------KALTDFIQ--EEFQHLKTQ-AEQLHNHLQPG
R-----------------------------------------------------------
-
>sp|Q06510|TAZ_YEAST Tafazzin OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TAZ1 PE=1 SV=1
MSFRDVLERGDEFLEAYPRRSPLWRFLSYSTSLLTFGVSKLLLFTCYNVK------LNGF
EKLETALERSKRENRG----LMTVMNHMSMVDDPLVWATLPYKLFTSLDNIRWSLGAHNI
CFQNKFLANFFSLGQ-------------------VLSTERF----------GVGPFQGSI
DASIRLLSPDDTLDLEWTPHSEVSSSLKKAYSPPIIRSKPSWVHVYPEGFVLQLYPPFEN
SMRYFKWGITRMILEATKPPIVVPIFATGFEKIASEAVTDSMFRQILPRNFGSEINVTIG
DPLND-DLIDRYRKEWTHLVEKYYDPKNPNDLSDELKYGKEAQDLRSRLAAELRAHVAEI
RNEVRKLPREDPRFKSPSWWKRFNTTEGKSDPDVKVIGENWAIRRMQKFLPPEGKPKGKD
D
ID: sp|Q6IV84|TAZ_PANTR
Name: sp|Q6IV84|TAZ_PANTR
Description: sp|Q6IV84|TAZ_PANTR Tafazzin OS=Pan troglodytes OX=9598 GN=TAFAZZIN PE=2 SV=2
Number of features: 0
Seq('MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNKEVLYELI...PGR')
>TAZ_HUMAN_FL [Homo sapiens]
MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELIEKRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVP
VCRGAEFFQAENEGKGVLDTGRHMPGAGKRREKGDGVYQKGMDFILEKLNHGDWVHIFPE
GKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIG
KPFSALPVLERLRAENKSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQPGR
>sp|Q6IV84|TAZ_PANTR Tafazzin OS=Pan troglodytes OX=9598 GN=TAFAZZIN PE=2 SV=2
MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNKEVLYELIENRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKLMRWTPAAADICFTKELHSHFFSLGKCVP
VCR------------------------------GDGVYQKGMDFILEKLNHGDWVHIFPE
GKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITVLIG
KPFSALPVLERLRAENKSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQPGR
ID: sp|Q92604|LGAT1_HUMAN
Name: sp|Q92604|LGAT1_HUMAN
Description: sp|Q92604|LGAT1_HUMAN Acyl-CoA:lysophosphatidylglycerol acyltransferase 1 OS=Homo sapiens OX=9606 GN=LPGAT1 PE=1 SV=1
Number of features: 0
Seq('MAITLEEAPWLGWLLVKALMRFAFMVVNNLVAIPSYICYVIILQPLRVLDSKRF...CLF')
>TAZ_HUMAN_FL [Homo sapiens]
MPLHVKWPFPAVPPLTWTLASSVVMGLVGTYSCFWTKYMNHLTVHNREVLYELIEKRGPA
TPLITVSNHQSCMDDPHLWGILKLRHIWNLKL------------MRWTP-----------
------AAADICFTKE--------------LHSHFFSLGK--CVPVCRGAEFFQAENEGK
GVLDT-----GRHMPGAGKRREKGDGVYQKGMDFILEKLNHGDWVHIFPEG-----KVNM
SSEFLR------------------------------------------------FKWGIG
RLIAECHLNPIILPLWHVGMNDVLPNSPPY--FP-----------------RFGQKITVL
IGKPF---SALPVLERLRAENKSAVEMRKALTD---FIQEEFQHLKTQAEQLHNHLQPGR
----
>sp|Q92604|LGAT1_HUMAN Acyl-CoA:lysophosphatidylglycerol acyltransferase 1 OS=Homo sapiens OX=9606 GN=LPGAT1 PE=1 SV=1
MAITLE----EAPWLGWLLVKALMR--------FAFMVVNNLVAIPSYICYVIILQ----
-PL-------RVLDSKRFWYIEGIMYKWLLGMVASWGWYAGYTVMEWGEDIKAVSKDEAV
MLVNHQATGDVCTLMMCLQDKGLVVAQMMWLMDHIFKYTNFGIVSLVHGDFFIR---QGR
SYRDQQLLLLKKHLENNYRSRDR-------------------KWIVLFPEGGFLRKRRET
SQAFAKKNNLPFLTNVTLPRSGATKIILNALVAQQKNGSPAGGDAKELDSKSKGLQWIID
TTIAYPKAEPIDIQTWILGYRKPTVTHVHYRIFPIKDVPLETDDLTTWLYQRFVEKEDLL
--SHFYETGAFPP----SKGHKEAVSREMTLSNLWIFLIQSFAFL--SGYMWYNIIQYFY
HCLF
ID: sp|Q9NRZ5|PLCD_HUMAN
Name: sp|Q9NRZ5|PLCD_HUMAN
Description: sp|Q9NRZ5|PLCD_HUMAN 1-acyl-sn-glycerol-3-phosphate acyltransferase delta OS=Homo sapiens OX=9606 GN=AGPAT4 PE=1 SV=1
Number of features: 0
Seq('MDLAGLLKSQFLCHLVFCYVFIASGLIINTIQLFTLLLWPINKQLFRKINCRLS...LND')
>TAZ_HUMAN_FL [Homo sapiens]
MPLH--------------------------------VKWP-----FPAVP-PLTWTLASS
VVMGL---VGTYSCFWTKYMNHLTVHNREVLYELIEKRGPATPLITVSNHQSCMDDPHLW
------GILKLRHIWNLKLMRWTPAAADICFTKEL-------------------------
HSHFFSLGKCVPVCRGAEFFQAENE-------GKGVLDTGRH-MPGAGKRREKG------
------DGVYQKGMDF-------ILEKLN----HGD--------------------WVHI
FPEGKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLPNSPPYFPRFGQKITV
LIGKPFSAL-----------PVLERLRAENKSAVEMRKALTDFIQEEF------------
-------QHLKTQAEQLHNHLQPGR
>sp|Q9NRZ5|PLCD_HUMAN 1-acyl-sn-glycerol-3-phosphate acyltransferase delta OS=Homo sapiens OX=9606 GN=AGPAT4 PE=1 SV=1
MDLAGLLKSQFLCHLVFCYVFIASGLIINTIQLFTLLLWPINKQLFRKINCRLSYCISSQ
LVMLLEWWSGTECTIFTDPRAYL-------------KYGKENAIV-VLNHKFEIDFLCGW
SLSERFGLLGGSKVLAKKELAYVPIIGWMWYFTEMVFCSRKWEQDRKTVATSLQHLRDYP
EKYFFLIH-----CEGTRFTEKKHEISMQVARAKGLPRLKHHLLP-----RTKGFAITVR
SLRNVVSAVYDCTLNFRNNENPTLLGVLNGKKYHADLYVRRIPLEDIPEDDDECSAWLHK
LYQEKDAFQEEYYR--------------------------TGTFPETPMVPPR-------
---RPWTLVNWLFWASLVLYPFFQFLVSMIRSGSSL--TLASFILVFFVASVGVRWMIGV
TEIDKGSAYGNSDSKQKLND-----
ID: sp|Q9HCL2|GPAT1_HUMAN
Name: sp|Q9HCL2|GPAT1_HUMAN
Description: sp|Q9HCL2|GPAT1_HUMAN Glycerol-3-phosphate acyltransferase 1, mitochondrial OS=Homo sapiens OX=9606 GN=GPAM PE=1 SV=3
Number of features: 0
Seq('MDESALTLGTIDVSYLPHSSEYSVGRCKHTSEEWGECGFRPTIFRSATLKWKES...VVL')
>TAZ_HUMAN_FL [Homo sapiens]
---------------MPLH------------------------------VKW--------
-PF------------------PAVPPL----------------TW---------------
-------------TLASSVVMGLVGTYSCFW---------TKYMNHLTVHNREVLYELIE
KRGPAT-------------------------------------PLITVSNHQSCMD----
----------DPHLWG------------ILKLRHIWNLKLMRWTP-AAADICFTKELHSH
FFSL------------------GK--C-----------------------VPV-------
-----------------------------------CRGAEFFQ-----------------
--------------------AENEGKGV-LDTGRHMPGAGKRRE--------------KG
DGV-------------YQKGMD--------FILEK-------------------------
------LNH---GDWVHIFPEGKVNMSSEFLRF------------------------KWG
IG-------RLIAE----------CHL----NPIILPLW---------------------
--HVGMNDVLPN----------SPPYFPR---------FGQK------------------
--ITVLIGKPFSAL------------PV-----LERL------RAEN-------------
-KSAVEMRKALTDFIQEEFQHLKTQAEQLHNHLQP--GR-------------
>sp|Q9HCL2|GPAT1_HUMAN Glycerol-3-phosphate acyltransferase 1, mitochondrial OS=Homo sapiens OX=9606 GN=GPAM PE=1 SV=3
MDESALTLGTIDVSYLP-HSSEYSVGRCKHTSEEWGECGFRPTIFRSATLKWKESLMSRK
RPFVGRCCYSCTPQSWDKFFNPSIPSLGLRNVIYINETHTRHRGWLARRLSYVLFIQERD
VHKGMFATNVTENVLNSSRVQEAIAEVAAELNPDGSAQQQSKAVNKVKKKAKRILQEMVA
TVSPAMIRLTGWVLLKLFNSFFWNIQIHKGQLEMVKAATETNLPLLFLPVHRSHIDYLLL
TFILFCHNIKAPYIASGNNLNIPIFSTLIHKLGGFFIRRRLDETPDGRKDVLYRALLHGH
IVELLRQQQFLEIFLEGTRSRSGKTSCARAGLLSVVVDTLSTNVIPDILIIPVGISYDRI
IEGHYNGEQLGKPKKNESLWSVARGVIRMLRKNYGCVRVDFAQPFSLKEYLESQSQKPVS
ALLSLEQALLPAILPSRPSDAADEGRDTSINESRNATDESLRRRLIANLAEHILFTASKS
CAIMSTHIVACLLLYRHRQGIDLSTLVEDFFVMKEEVLARDFDLGFSGNSEDVVMHAIQL
LGNCVTITHTSRNDEFFITPSTTVPSVFE-LNFYSNGVLHVFIMEAIIACSLYAVLNKRG
LGGPTSTPPNLISQEQLVRKAASLCYLLSNEGTISLPCQTFYQVCHETVGKFIQYGILTV
AEHDDQEDISPSLAEQQWDKKLPEPLSWRSDEEDEDSDFGEEQRDCYLKVSQSKEHQQFI
TFLQRLLGPLLEAYSSAAIFVHNFSGPVPEPEYLQKLHKYLITRTERNVAVYAESATYCL
VKNAVKMFKDIGVF--KETKQKRVSVLELSSTFLPQCNRQKLLEYILSFVVL
ID: sp|P40939|ECHA_HUMAN
Name: sp|P40939|ECHA_HUMAN
Description: sp|P40939|ECHA_HUMAN Trifunctional enzyme subunit alpha, mitochondrial OS=Homo sapiens OX=9606 GN=HADHA PE=1 SV=2
Number of features: 0
Seq('MVACRAIGILSRFSAFRILRSRGYICRNFTGSSALLTRTHINYGVKGDVAVVRI...FYQ')
>TAZ_HUMAN_FL [Homo sapiens]
------------------------------------MPLHVKW-------------PFPA
VPPLTWTLAS--SVVMG------------LVGTY-SCFWT----------KYMNHLTVHN
REVLYELIEKRGPATPLITVSNHQSCMDD-------------------------------
PHLWGILKLRHIWNLKLMRWTPAAADICFT------------------------------
----------------------------KELHSHFFSLGKCVPVCRGAEFFQAE----NE
GKG----------VLDTG------------------------------------------
--------RHMP--GAGKRREKGDGVYQ----KGM-----DFILEKLNHG----------
-----------------------DW-----------------------------------
--------------------------VHIFP-----------------------------
-EGKVNMSSEFLRFKWGIGRLIAECHLNPIILPLWHVGMNDVLP------NSPPYFP---
-------------------------RFG----QKITVLIGKPFSAL--------------
---------PVLERLRAENKSAVEMR-----KALTDFIQEEFQHLK--------------
----------------------------------------TQAEQLHNHLQ-PGR----
>sp|P40939|ECHA_HUMAN Trifunctional enzyme subunit alpha, mitochondrial OS=Homo sapiens OX=9606 GN=HADHA PE=1 SV=2
MVACRAIGILSRFSAFRILRSRGYICRNFTGSSALLTRTHINYGVKGDVAVVRINSPNSK
VNTLSKELHSEFSEVMNEIWASDQIRSAVLISSKPGCFIAGADINMLAACKTLQEVTQLS
QEA-QRIVEKLEKSTKPIVAAINGSCLGGGLEVAISCQYRIATKDRKTVLGTPEVLLGAL
PGAGGTQRLPK------MVGVPAALDMMLTGRSIRADRAKKMGLVDQLVEPLGPGLKPPE
ERTIEYLEEVAITFAKGLADKKISPKRDKGLVEKLTAYAMTIPFVRQQVYKKVEEKVRKQ
TKGLYPAPLKIIDVVKTGIEQGSDAGYLCESQKFGELVMTKESKALMGLYHGQVLCKKNK
FGAPQKDVKHLAILGAGL---MGAGIAQVSVDKGLKTILKDATLTALDRGQQQVFKGLND
KVKKKALTSFERDSIFSNLTGQLDYQGFEKADMVIEAVFEDLSLKHRVLKEVEAVIPDHC
IFASNTSALPISEIAAVSKRPEKVIGMHYFSPVDKMQLLEIITTEKTSKDTSASAVAVGL
KQGKV-----IIVVKDGPGFYTTRC-LAPMMSEVIRILQEGVDPKKLDSLTTSFGFPVGA
ATLVDEVGVDVAKHVAEDLGKVFGERFGGGNPELLTQMVSKGFLGRKSGKGFYIYQEGVK
RKDLNSDMDSILASLKLPPKSEVSSDEDIQFRLVTRFVNEAVMCLQEGILATPAEGDIGA
VFGLGFPPCLGGPFRFVDLYGAQKIVDRLKKYEAAYGKQFTPCQLLADHANSPNKKFYQ
Homology modeling is a bioinformatics method that constructs a 3D protein structure using sequences and structures of similar proteins. When proteins have over 20% identity in amino acids, their amino acid sequences can be used to compare and construct a model for a protein that otherwise may not have a predicted structure. Structural bioinformatics allows to view and explore these 3D protein structures as well as manipulate how these proteins are displayed.
In the code below, information from pdb files were used to generate 3D protein structures of different Taffazin orthologs as well as other mitochondrial acyltransferases. The pdb files contain atomic information for each of the displayed proteins.
#define global variable parser to convert PDB files in a format the program can understand
parser = PDBParser()
#define global variable structure1 that will contain information in TAZ_human.pdb
#it must be used with the parser.get_structure function for the program to analyze and output the structure
structure1 = parser.get_structure("TAZ_human", "TAZ_human.pdb")
#define global variable TAZ_human that contains an NGLWidget allowing us to interact with the 3D protein structure
TAZ_human = nv.show_biopython(structure1)
TAZ_human
#creating and displaying an image that crops the structure and has a higher resolution
TAZ_human.render_image(trim=True, factor=6)
TAZ_human._display_image()
#the subsequent code is the same as the chunk above, just using different pdb files for different proteins
#comments will be the same, just defining global variables for different files or widgets for different proteins
structure2 = parser.get_structure("TAZ_mouse", "TAZ_mouse.pdb")
TAZ_mouse = nv.show_biopython(structure2)
TAZ_mouse
TAZ_mouse.render_image(trim=True, factor=6)
TAZ_mouse._display_image()
structure3 = parser.get_structure("TAZ_chimp", "TAZ_chimp.pdb")
TAZ_chimp = nv.show_biopython(structure3)
TAZ_chimp
TAZ_chimp.render_image(trim=True, factor=6)
TAZ_chimp._display_image()
structure4 = parser.get_structure("TAZ_fly", "TAZ_fly.pdb")
TAZ_fly = nv.show_biopython(structure4)
TAZ_fly
TAZ_fly.render_image(trim=True, factor=6)
TAZ_fly._display_image()
structure5 = parser.get_structure("TAZ_yeast", "TAZ_yeast.pdb")
TAZ_yeast = nv.show_biopython(structure5)
TAZ_yeast
TAZ_yeast.render_image(trim=True, factor=6)
TAZ_yeast._display_image()
structure6 = parser.get_structure("TAZ_zebrafish", "TAZ_zebrafish.pdb")
TAZ_zebrafish = nv.show_biopython(structure6)
TAZ_zebrafish
TAZ_zebrafish.render_image(trim=True, factor=6)
TAZ_zebrafish._display_image()
structure7 = parser.get_structure("GPAT1", "GPAT1.pdb")
GPAT1 = nv.show_biopython(structure7)
GPAT1
GPAT1.render_image(trim=True, factor=6)
GPAT1._display_image()
structure8 = parser.get_structure("hTFPa", "hTFPa.pdb")
hTFPa = nv.show_biopython(structure8)
hTFPa
hTFPa.render_image(trim=True, factor=6)
hTFPa._display_image()
structure9 = parser.get_structure("LGPAT1", "LGPAT1.pdb")
LGPAT1 = nv.show_biopython(structure9)
LGPAT1
LGPAT1.render_image(trim=True, factor=6)
LGPAT1._display_image()
structure10 = parser.get_structure("PLCD", "PLCD.pdb")
PLCD = nv.show_biopython(structure10)
PLCD
PLCD.render_image(trim=True, factor=6)
PLCD._display_image()
Phylogenetic clustering is a analysis method that compares the genetic sequences of a protein to see how genetically related a group of subjects are. In the code below, Biopython calculates the genetic distance between the different protein sequences in the fasta file containing the different taffazin sequences. Then, these calculated values are used to construct a phylogenetic tree that visually depicts how related the chosen proteins are or where they diverged
#set global variable calculator that contains an object for the distance between the identities (sequences)
calculator = DistanceCalculator('identity')
#set global variable distMatrix that contains the distances between each sequence
distMatrix = calculator.get_distance(alignment)
#code check: is the distance being calculated for all sequences?
#print(distMatrix)
#set global variable constructor that contains an object for the tree
constructor = DistanceTreeConstructor(calculator)
#set global variable TAZTree that contains the information needed to build the phylogenetic tree
TAZTree = constructor.nj(distMatrix)
#code check: is the cladogram and branches being calculated for all sequences?
#print(TAZTree)
#save information in a file TAZTree.xml
Phylo.write(TAZTree, 'TAZTree.xml', 'phyloxml')
1
#Define global variable fig that contains the phylogenetic tree for TAZ
#I kept this display since it also displays the taxa and branch length
fig = Phylo.draw(TAZTree)
#Redefine global variable fig to format it in a way that is more easily read
fig = Phylo.draw_ascii(TAZTree)
_______ TAZ_HUMAN_FL _____| | | ___ sp|F1QCP6|TAZ_DANRE | |___| | | , sp|Q91WF0|TAZ_MOUSE | |__| | , sp|Q6IV84|TAZ_PANTR | | | | sp|Q16635|TAZ_HUMAN_exon5 | |_____________ sp|Q9V6G5|TAZ_DROME | _| ___________________________ sp|P40939|ECHA_HUMAN |_______________| | |_________________________________ sp|Q9HCL2|GPAT1_HUMAN | |_____________ sp|Q06510|TAZ_YEAST | |_____________ sp|Q9NRZ5|PLCD_HUMAN | |____________ sp|Q92604|LGAT1_HUMAN
3D protein measurement measures values in the protein structure that was created. Here, we measure the molecular weight, average hydropathy, number and percentage of histidine/aspartate residues, percent composition of each secondary structure, aromaticity, instability index, isoelectirc point, and molar extinction coefficient of the chosen proteins.
#create a for loop that loops over each sequence
for record in SeqIO.parse('TAZ.fasta', 'fasta'):
#set local variable X that contains a proteinanalysis object for each individual sequence in the fasta file
X = ProteinAnalysis(str(record.seq))
#print the protein characteristics you're interested in
print('\n### Results for record: {} ###'.format(record.id))
print('Molecular Weight: ' + str(X.molecular_weight()))
print('Average Hydropathy: ' + str(X.gravy()))
print('Histidine Count: ' + str(X.count_amino_acids()['H']))
print('Aspartate Count: ' + str(X.count_amino_acids()['D']))
print('Amino Acid Count: ' + str(X.count_amino_acids()))
print('Percent Histidine: ' + "%0.2f" % X.get_amino_acids_percent()['H'])
print('Percent Aspartate: '"%0.2f" % X.get_amino_acids_percent()['D'])
print('Secondary Structures (helix, turn, sheet): ' + str(X.secondary_structure_fraction()))
print('Aromaticity: ' + "%0.2f" % X.aromaticity())
print('Instability_index: ' + "%0.2f" % X.instability_index())
print('Isoelectric_point: ' + "%0.2f" % X.isoelectric_point())
#set local variable sec_struc to view secondary structures
sec_struc = X.secondary_structure_fraction()
print('Secondary Structure Fraction: ' + "%0.2f" % sec_struc[0])
#set local variable epsilon_prot to view molar extinction coefficient
epsilon_prot = X.molar_extinction_coefficient()
print('Molar Extinction Coefficient: ' + str(epsilon_prot[0]))
### Results for record: TAZ_HUMAN_FL ###
Molecular Weight: 33458.62240000003
Average Hydropathy: -0.22705479452054808
Histidine Count: 16
Aspartate Count: 9
Amino Acid Count: {'A': 15, 'C': 6, 'D': 9, 'E': 18, 'F': 16, 'G': 22, 'H': 16, 'I': 14, 'K': 19, 'L': 32, 'M': 10, 'N': 12, 'P': 21, 'Q': 9, 'R': 15, 'S': 12, 'T': 13, 'V': 19, 'W': 9, 'Y': 5}
Percent Histidine: 0.05
Percent Aspartate: 0.03
Secondary Structures (helix, turn, sheet): (0.3253424657534246, 0.22945205479452052, 0.2568493150684931)
Amomaticity: 0.10
Instability_index: 48.77
Isoelectric_point: 9.10
Secondary Structure Fraction: 0.33
Molar Extinction Coefficient: 56950
### Results for record: sp|Q16635|TAZ_HUMAN_exon5 ###
Molecular Weight: 30203.041100000017
Average Hydropathy: -0.11297709923664125
Histidine Count: 15
Aspartate Count: 8
Amino Acid Count: {'A': 12, 'C': 6, 'D': 8, 'E': 14, 'F': 14, 'G': 16, 'H': 15, 'I': 14, 'K': 16, 'L': 31, 'M': 9, 'N': 11, 'P': 20, 'Q': 8, 'R': 12, 'S': 12, 'T': 12, 'V': 18, 'W': 9, 'Y': 5}
Percent Histidine: 0.06
Percent Aspartate: 0.03
Secondary Structures (helix, turn, sheet): (0.3473282442748092, 0.22519083969465647, 0.25190839694656486)
Amomaticity: 0.11
Instability_index: 50.21
Isoelectric_point: 9.01
Secondary Structure Fraction: 0.35
Molar Extinction Coefficient: 56950
### Results for record: sp|Q91WF0|TAZ_MOUSE ###
Molecular Weight: 30333.20920000002
Average Hydropathy: -0.12862595419847325
Histidine Count: 14
Aspartate Count: 8
Amino Acid Count: {'A': 11, 'C': 6, 'D': 8, 'E': 14, 'F': 15, 'G': 16, 'H': 14, 'I': 14, 'K': 16, 'L': 30, 'M': 10, 'N': 12, 'P': 19, 'Q': 8, 'R': 13, 'S': 12, 'T': 12, 'V': 18, 'W': 9, 'Y': 5}
Percent Histidine: 0.05
Percent Aspartate: 0.03
Secondary Structures (helix, turn, sheet): (0.34732824427480913, 0.22519083969465647, 0.2480916030534351)
Amomaticity: 0.11
Instability_index: 48.45
Isoelectric_point: 9.14
Secondary Structure Fraction: 0.35
Molar Extinction Coefficient: 56950
### Results for record: sp|F1QCP6|TAZ_DANRE ###
Molecular Weight: 30574.121900000027
Average Hydropathy: -0.27213740458015273
Histidine Count: 11
Aspartate Count: 11
Amino Acid Count: {'A': 7, 'C': 6, 'D': 11, 'E': 13, 'F': 12, 'G': 16, 'H': 11, 'I': 15, 'K': 9, 'L': 24, 'M': 12, 'N': 14, 'P': 17, 'Q': 11, 'R': 20, 'S': 13, 'T': 15, 'V': 21, 'W': 9, 'Y': 6}
Percent Histidine: 0.04
Percent Aspartate: 0.04
Secondary Structures (helix, turn, sheet): (0.3320610687022901, 0.22900763358778625, 0.21374045801526717)
Amomaticity: 0.10
Instability_index: 40.18
Isoelectric_point: 8.94
Secondary Structure Fraction: 0.33
Molar Extinction Coefficient: 58440
### Results for record: sp|Q9V6G5|TAZ_DROME ###
Molecular Weight: 43015.5505
Average Hydropathy: -0.27592592592592613
Histidine Count: 8
Aspartate Count: 21
Amino Acid Count: {'A': 20, 'C': 7, 'D': 21, 'E': 16, 'F': 14, 'G': 20, 'H': 8, 'I': 27, 'K': 22, 'L': 32, 'M': 11, 'N': 17, 'P': 37, 'Q': 14, 'R': 27, 'S': 22, 'T': 14, 'V': 30, 'W': 8, 'Y': 11}
Percent Histidine: 0.02
Percent Aspartate: 0.06
Secondary Structures (helix, turn, sheet): (0.32275132275132273, 0.25396825396825395, 0.20899470899470898)
Amomaticity: 0.09
Instability_index: 47.78
Isoelectric_point: 9.35
Secondary Structure Fraction: 0.32
Molar Extinction Coefficient: 60390
### Results for record: sp|Q06510|TAZ_YEAST ###
Molecular Weight: 44187.05960000001
Average Hydropathy: -0.47270341207349087
Histidine Count: 6
Aspartate Count: 23
Amino Acid Count: {'A': 17, 'C': 2, 'D': 23, 'E': 27, 'F': 23, 'G': 19, 'H': 6, 'I': 18, 'K': 26, 'L': 40, 'M': 9, 'N': 17, 'P': 27, 'Q': 7, 'R': 28, 'S': 31, 'T': 18, 'V': 21, 'W': 10, 'Y': 12}
Percent Histidine: 0.02
Percent Aspartate: 0.06
Secondary Structures (helix, turn, sheet): (0.3254593175853018, 0.24671916010498685, 0.2440944881889764)
Amomaticity: 0.12
Instability_index: 36.20
Isoelectric_point: 8.80
Secondary Structure Fraction: 0.33
Molar Extinction Coefficient: 72880
### Results for record: sp|Q6IV84|TAZ_PANTR ###
Molecular Weight: 30160.958000000013
Average Hydropathy: -0.10916030534351147
Histidine Count: 15
Aspartate Count: 8
Amino Acid Count: {'A': 12, 'C': 6, 'D': 8, 'E': 14, 'F': 14, 'G': 16, 'H': 15, 'I': 14, 'K': 16, 'L': 31, 'M': 9, 'N': 12, 'P': 20, 'Q': 8, 'R': 11, 'S': 12, 'T': 12, 'V': 18, 'W': 9, 'Y': 5}
Percent Histidine: 0.06
Percent Aspartate: 0.03
Secondary Structures (helix, turn, sheet): (0.3473282442748092, 0.22900763358778625, 0.25190839694656486)
Amomaticity: 0.11
Instability_index: 49.87
Isoelectric_point: 8.87
Secondary Structure Fraction: 0.35
Molar Extinction Coefficient: 56950
### Results for record: sp|Q92604|LGAT1_HUMAN ###
Molecular Weight: 43088.86390000004
Average Hydropathy: 0.0016216216216214767
Histidine Count: 9
Aspartate Count: 17
Amino Acid Count: {'A': 23, 'C': 4, 'D': 17, 'E': 17, 'F': 20, 'G': 22, 'H': 9, 'I': 26, 'K': 24, 'L': 42, 'M': 14, 'N': 13, 'P': 14, 'Q': 15, 'R': 17, 'S': 18, 'T': 19, 'V': 24, 'W': 14, 'Y': 18}
Percent Histidine: 0.02
Percent Aspartate: 0.05
Secondary Structures (helix, turn, sheet): (0.3891891891891892, 0.1810810810810811, 0.2594594594594595)
Amomaticity: 0.14
Instability_index: 41.04
Isoelectric_point: 9.02
Secondary Structure Fraction: 0.39
Molar Extinction Coefficient: 103820
### Results for record: sp|Q9NRZ5|PLCD_HUMAN ###
Molecular Weight: 44020.814600000085
Average Hydropathy: -0.009259259259259524
Histidine Count: 9
Aspartate Count: 15
Amino Acid Count: {'A': 18, 'C': 10, 'D': 15, 'E': 21, 'F': 25, 'G': 20, 'H': 9, 'I': 21, 'K': 24, 'L': 48, 'M': 8, 'N': 14, 'P': 15, 'Q': 11, 'R': 21, 'S': 25, 'T': 20, 'V': 25, 'W': 12, 'Y': 16}
Percent Histidine: 0.02
Percent Aspartate: 0.04
Secondary Structures (helix, turn, sheet): (0.38888888888888884, 0.19576719576719576, 0.2513227513227513)
Amomaticity: 0.14
Instability_index: 39.44
Isoelectric_point: 8.95
Secondary Structure Fraction: 0.39
Molar Extinction Coefficient: 89840
### Results for record: sp|Q9HCL2|GPAT1_HUMAN ###
Molecular Weight: 93793.6966000005
Average Hydropathy: -0.13719806763285072
Histidine Count: 24
Aspartate Count: 33
Amino Acid Count: {'A': 49, 'C': 18, 'D': 33, 'E': 57, 'F': 41, 'G': 41, 'H': 24, 'I': 52, 'K': 43, 'L': 97, 'M': 12, 'N': 34, 'P': 35, 'Q': 36, 'R': 49, 'S': 72, 'T': 44, 'V': 57, 'W': 9, 'Y': 25}
Percent Histidine: 0.03
Percent Aspartate: 0.04
Secondary Structures (helix, turn, sheet): (0.3393719806763285, 0.21980676328502416, 0.25966183574879226)
Amomaticity: 0.09
Instability_index: 55.38
Isoelectric_point: 7.81
Secondary Structure Fraction: 0.34
Molar Extinction Coefficient: 86750
### Results for record: sp|P40939|ECHA_HUMAN ###
Molecular Weight: 82998.68110000047
Average Hydropathy: -0.07785058977719528
Histidine Count: 9
Aspartate Count: 38
Amino Acid Count: {'A': 63, 'C': 13, 'D': 38, 'E': 46, 'F': 31, 'G': 68, 'H': 9, 'I': 46, 'K': 71, 'L': 75, 'M': 19, 'N': 17, 'P': 33, 'Q': 32, 'R': 33, 'S': 48, 'T': 38, 'V': 64, 'W': 1, 'Y': 18}
Percent Histidine: 0.01
Percent Aspartate: 0.05
Secondary Structures (helix, turn, sheet): (0.30799475753604194, 0.21756225425950193, 0.26605504587155965)
Amomaticity: 0.07
Instability_index: 32.81
Isoelectric_point: 9.16
Secondary Structure Fraction: 0.31
Molar Extinction Coefficient: 32320
Multiple sequence alignment shows that there are conserved spans of amino acid residues in the tafazzin protein as well as other mitochondrial acyltransferases. Structural bioinformatics show that the acyltransferases all have a loop structure (as positioned at the lower left corder in the displayed images). This loop structure may correspond to the cleft mentioned in Hijikata et al, where the substrate (acyl chains) would enter to be positioned and transferred to the phospholipid. Additionally, the phylogenic tree shows how related the human tafazzin protein is to its orthologs and other acyltransferases. The long isoform of human tafazzin protein is least related to the fly and yeast tafazzin proteins. It is also least genetically similar to other acyltransferases that were analyzed in this code. However, it seems that tafazzin from zebrafish, chimps, mice, flies, and the exon 5 deletion of human tafazzin protein all share a common ancestor that had the human tafazzin long isoform. The mouse, chimp and fly tafazzin share a common ancestor that may have had the zebrafish tafazzin. 3D protein structural analysis shows that the human exon 5 deletion, mouse, and chimp tafazzin orthologs have similar hydropathy of around -0.1. However, the human full length, zebrafish, and fly tafazzin orthologs have almost double the hydropathy of -0.2. Interestingly, the yeast ortholog has almost fivefold hydropathy of the human exon 5 deletion tafazzin protein (hydropathy of -0.5). Despite these differences, the orthologs have similar percentages of histidine and arginine residues, except for the fly and yeast orthologs, which had significantly less histidine residues and significantly more asparate residues than the others. Despite the differences in residues, all orthologs and even the other acyltransferases have similar composition of secondary structures. Additionally, the aromaticity, instability index, and isoelectric points across homologs are also similar in value. With the information drawn, my hypothesis was correct that important functional domains of tafazzin, such as the acyltransferase domain, must be highly conserved across species and similar to other phospholipid acyltransferases since the amino acid sequence, protein structure, phylogenetic analysis, and 3D protein measurements show similarities and connections between tafazzin orthologs and related acyltransferases.